An Efficient Framework to Extract Parallel Units from Comparable Data

نویسندگان

Lu Xiang

Yu Zhou

Chengqing Zong

چکیده

Since the quality of statistical machine translation (SMT) is heavily dependent upon the size and quality of training data, many approaches have been proposed for automatically mining bilingual text from comparable corpora. However, the existing solutions are restricted to extract either bilingual sentences or sub-sentential fragments. Instead, we present an efficient framework to extract both sentential and sub-sentential units. At sentential level, we consider the parallel sentence identification as a classification problem and extract more representative and effective features. At sub-sentential level, we refer to the idea of phrase table’s acquisition in SMT to extract parallel fragments. A novel word alignment model is specially designed for comparable sentence pairs and parallel fragments can be extracted based on such word alignment. We integrate the two levels’ extraction task into a united framework. Experimental results on SMT show that the baseline SMT system can achieve significant improvement by adding those extra-mined knowledge.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Efficiency Measurement of Clinical Units Using Integrated Independent Component Analysis-DEA Model under Fuzzy Conditions

Background and Objectives: Evaluating the performance of clinical units is critical for effective management of health settings. Certain assessment of clinical variables for performance analysis is not always possible, calling for use of uncertainty theory. This study aimed to develop and evaluate an integrated independent component analysis-fuzzy-data envelopment analysis approach to accurate ...

متن کامل

استخراج پیکره‌ موازی از اسناد قابل‌مقایسه برای بهبود کیفیت ترجمه در سیستم‌های ترجمه ماشینی

Data used for training statistical machine translation method are usually prepared from three resources: parallel, non-parallel and comparable text corpora. Parallel corpora are an ideal resource for translation but due to lack of these kinds of texts, non-parallel and comparable corpora are used either for parallel text extraction. Most of existing methods for exploiting comparable corpora loo...

متن کامل

Using MOLP based procedures to solve DEA problems

Data envelopment analysis (DEA) is a technique used to evaluate the relative efficiency of comparable decision making units (DMUs) with multiple input-output. It computes a scalar measure of efficiency and discriminates between efficient and inefficient DMUs. It can also provide reference units for inefficient DMUs without consideration of the decision makers’ (DMs) preferences. In this paper, ...

متن کامل

A Recurrent Neural Network to Identify Efficient Decision Making Units in Data Envelopment Analysis

In this paper we present a recurrent neural network model to recognize efficient Decision Making Units(DMUs) in Data Envelopment Analysis(DEA). The proposed neural network model is derived from an unconstrained minimization problem. In theoretical aspect, it is shown that the proposed neural network is stable in the sense of lyapunov and globally convergent. The proposed model has a single-laye...

متن کامل

An Efficient Framework for Extracting Parallel Sentences from Non-Parallel Corpora

Automatically building a large bilingual corpus that contains millions of words is always a challenging task. In particular in case of low-resource languages, it is difficult to find an existing parallel corpus which is large enough for building a real statistical machine translation. However, comparable non-parallel corpora are richly available in the Internet environment, such as in Wikipedia...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2013

An Efficient Framework to Extract Parallel Units from Comparable Data

نویسندگان

چکیده

منابع مشابه

Efficiency Measurement of Clinical Units Using Integrated Independent Component Analysis-DEA Model under Fuzzy Conditions

استخراج پیکره‌ موازی از اسناد قابل‌مقایسه برای بهبود کیفیت ترجمه در سیستم‌های ترجمه ماشینی

Using MOLP based procedures to solve DEA problems

A Recurrent Neural Network to Identify Efficient Decision Making Units in Data Envelopment Analysis

An Efficient Framework for Extracting Parallel Sentences from Non-Parallel Corpora

عنوان ژورنال:

اشتراک گذاری